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Abstract 

A measure of indexing consistency is developed based on the concept 
of "fuzzy sets". It assigns a higher consistency value if indexers agree 
on the more important terms than if they agree on less important terms. 
Measures of the quality of an indexer's work and exhaustivity of indexing 
are also proposed. Experimental data on indexing consistency is presented 
for certain categories of indexers , and consistency, quality, and exhaus- 
tivity values are compared and analyzed. The analysis of indexing exhaus- 
tivity leads to the conclusicn that the increase of information as a result 
of group indexing is a process analogous to the Bradford's law of infor- 
mation scattering, Lotka's law of scientific productivity, and Zipf's law 
of vocabulary distribution. 
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INDEXING CONSISTENCY AND QUALITY 

by 

*) 

Pranas Zunde and Margaret E. Dexter 

Definition of Indexing Consistency . It is well known that any two 
indexers, indexing one and the same document individually, will select 
sets of indexing terms which are most unlikely to be identical. In 
other words, if one compares indexing terms assigned by any two indexers 
to the same document, one discovers that, as a rule, the indexers differ 
considerably in their judgment as to which terms reflect the contents 
of the document most adequately. It is clear that this difference in 
the judgment of indexers introduces a great deal of uncertainty in any 
information retrieval system based on human indexing. Although an infor 
mation system normally would not contain identical documents indexed by 
different indexers, the implication is that documents which are similar 
may and often would be indexed so differently that their similarity 
would not be properly reflected in the sets of indexing terms assigned 
to these documents . Various tools have been developed to reduce this 
discrepancy in the judgment of, i.e. term assignments by^the indexers, 
but the discussion of these tools is not the purpose of this article. 

Since indexing consistency manifests itself in the similarity (or 
dissimilarity) of indexing terms assigned to a given document by differ- 
ent indexers, and since the selection of indexing terms by an indexer 
reflects his judgment regarding the information contained in the docu- 
ment and its representation, indexing consistency is essentially a 

*) Georgia Institute of Technology, School of Information Science. 
This research was partially supported by the NSF Grant GN-655. 
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■easure of the similarity of reaction of different human beings pro- 
cessing the same information. Thus, more precisely, we shall define 
indexing consistency in a group of indexers as the degree of agreement 
in the representation of the essential information content of the docu- 
ment by certain sets of indexing terms selected individually and inde- 
pendently by each of the indexers in the group. 

Previous Co nsistency Studies . In a survey of indexing consistency 
studies. Hooper (1) cited consistency values ranging from 10% to 80%, 
depending on the conditions under which the indexing was performed and 
the measure of consistency used. Jacoby (2) reported a mean consistency 
of 10% when chemical patents were indexed by three experienced and three 
inexperienced indexers, and no indexing aids were used. Consistency 
values of 35% to 45% were obtained by Slamecka and Jacoby (3) for experi- 
enced indexers using certain indexing aids, such as controlled vocabu- 
lary. 

In a third report, Jacoby and Slamecka (4) arrived at a consistency 
of 16.3* for experienced indexers and 12.6% for inexperienced indexers. 

No indexing aids other than indexing rules were used. 

Painter (5) reported consistency values of 40%, 42%, 48% and 70% 
with varying indexing systems and types of documents'. Rodgers (6) came 
up with an average consistency of 24% for combinations of two indexers 
in a group. Korotkin and Oliver (7) reported consistencies ranging from 
36% to 59% in an experiment in which five psychologists and five non- 
psychologists indexed abstracts. 

Consistencies up to 80% were reached when indexers were required 
to use classification schedules and thesauri and had very limited freedom 
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or no freedom at all to use terms not contained in the above indexing 
aids (5, 7, 8, 9, 10). 

Schultz, Schultz and Orr (11) measured the goodness of author 
indexing by using a set of indexers as a criterian group to establish 
a weight for each term. The weight for a term for a given document was 
defined as the square of the number of criterian group indexers selecting 
the term for the specific document. The goodness of the author* s index 
set was taken to be the sum of the weights of the terms “elected. This 
measure was then normalized by dividing the score for each document by 
the maximum score possible for the document, the sum cf the weights of 
all terms selected by the criterian group for the document. They found 
that the mean score for the authors was 76%. 

Consistency Measure Used in Previous Studies . The consistency 
measure used in most previous studies was the ratio of the number of 
terms selected by all indexers in the group to the total number of dif- 
ferent terms selected for the document. For a group of two indexers the 
consistency measure on a given document is thus defined as 

^Y} V a) 

lj "(Tj U T 2 ) 

where T and T denote the sets of terms selected by the first and sec- 
ond indexers respectively, and n(T) denotes the number of elements in 
set T. 

This measure can be extended to k > 2 indexers by taking 

HT l n t 2 n ... 

Q s ■ ■ - ■ — 

Vr-T n(TjU t 2 U ... UV 




( 2 ) 
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■ain disadvantage of such a consistency Measure is the underlying 
assumption that all the indexing terns selected are equally significant 
and relevant for the representation of the information content of the doc- 
ument. In other words , measures Eq(l) and Eq(2) completely disregard the 
difference between agreement on significant and the agreement on insignif- 
icant terms. The result, as will be shown in this paper, is that the above 
measures tend to produce consistently lower consistency measures than one 
would intuitively be willing to accept. For example, consider three in- 
dexers indexing (one and the same) document on jet propulsion and assume, 
for simplicity, that each of them assigned three indexing terms to the doc- 
ument as shown below 



Indexing Term 


Indexer 




l 


2 


3 


jet 


X 


X 


X 


propulsion 


X 


X 




flow 


X 




X 


level 




X 


X 



Assume further that the terms JET and PROPULSION refer to the main 
topic of the document, whereas FLOW and LEVEL refer to topics of marginal 
importance. From Eq(l), the consistency of any two of these three indexers 
is 0.50 (or 50 percent). But in effect the indexing consistency of 1 and 2 
should be considered much higher than either that of 1 and 3 or of 2 and 3 
since the common part of sets I I and I 2 above contain both of the signifi- 
cant words JET and PROPULSION. Therefore the measures Eq(l) or (2) would 
be fair measures of consistency only in the absence of any information 
whatsoever on the relative importance of the indexing terms, but they 
obviously do not adequately reflect the agreement of indexers* judgment if 
such information is available. 
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Proposed Consistency Measure . The consistency measure which is 
proposed here and which is expected to eliminate, at least partially, 
the shortcomings of the cons is .ency measures of the type expressed by 
Eq (1) and Eq (2), is based on the postulate that there exists no well 
defined set of ''relevant", "most indicative", "most pertinent", "most 
informative", etc. indexing terms for a document, because there exist 
no objective criteria which would enable us to construct such sets. 
Indexing performance of human indexers demonstrates this clearly, 
because if such criteria were available, we could apply them to obtain 
100 percent indexing consistency. 

An alternative to a well defined set is a "fuzzy set" which has 
been proposed by Zadeh (12) . Whereas well defined sets have precisely 
stated criteria of membership, a fuzzy set is a collection of objects 
which meet the criteria of membership to a varying degree which is 
assumed as given. More precisely, a fuzzy set A in a set X of objects 
x is characterized by a membership function f(x) which associates with 
each point in X a real number in the interval [0,l], with the value of 
f (x) at x representing the "grade of membership" of x in A. The 
nearer the value of f A (x) to uni /, the higher the grade of membership 
of x in A. If the function f A (x) can assume only the values 0 and 1, 
it reduces to the familiar characteristic function of the set A. For 
further discussion of the fuzzy sets and operations on them, see Appendix. 

Consider now the set of all English words and phrases, which are 
potential indexing terms for any kind of document. We shall say that 
this set of words and phrases is a fuzzy set with respect to the member- 
ship criteria of "being representative of or pertinent to" a given 
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document D , the degree of membership of each ter* reflecting the degree 
of agreement as to its significance with respect to the information it 



conveys about the document indexed. We shall call this set the global 
(indexing) set U for the document D . 

We shall further define the set T\ of indexing terns t assigned by 

an indexer 1 ^ to the document D to be the subset of the fuzzy set U such 

that for each t e 7 , f (t) = f n (t). In other words, the set T. is a 

* 7 U J 

fuzzy set obtained by associating with each indexing term t selected by 



the indexer I. the membership value which that term t has in the set U. 
For example, the set 7 ^ = (t ^ , t^ t t^, t^} = (blood, circulation, heart, 
disease} with the membership function f(t ) = 0.8, f(t_) = 0 . 85 , 

* M 

f(t 3 ) = 0 . 95 , f(t^) = 0.7 might represent the set of indexing terms 
assigned by an indexer 1^ to some document U . 

Now let (Ij, I 2 ,***, 1 ^} be a group of indexers and (T^, T 2»***» T B |^ 
a (well defined) collection of fuzzy sets of indexing terms t assigned 
by each of these indexers to one and the same document D. We shall 



define the measure of consistency of the group of indexers 1^, I 2 
by the expression 

n* - * t ** J 



(k) 



toT.lt) 



n > i J 



where n r n ... nT (t)* £.f nr Xt) denotes the sum of the 

_ tr * t »h j 

membership values of the intersection set of the fuzzy sets 

^ t^OT^O.- UTn, It &toT- (t) 

t t **** J 



T., j - 1, 2,*** , m, and 



denotes the sum of the membership values of the union set of the fuzzy 
sets Tj, j = 1, 2,**» , m. 
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Quality and Exhaustivity Measures . Furthermore we shall define 
the measure of quality of an index to some document D to be the expres- 
sion 

£ f T 60 

t * (4) 

where £f u (t) denotes the sum of the membership values of all the 
t U 



elements t of the global set 

and r (4) denotes the sum of the membership values of all the 

t T i 



elements t of the fuzzy set of indexing terms T., assigned by the 

indexer I . to the document D . 

3 

It should be noted that the measure of quality given by expression 
(3) can be viewed as the measure of goodness of performance of indexer 
1 . in indexing document D, where by performance is understood the capa- 
bilities of the indexer to select indexing terms which would give maxi- 
mum information about the document indexed. 

Finally we shall define the measure of exhaustivity by the expres- 



sion 



Sf n T . ( t ) AvtuTj M 

* w ' 



(5) 



WV" (t) 



where 



f Or M 

m j 



(t) 
_HL2 

tVj (t) 



( 6 ) 
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and 




fjj) 

m £f u (*) 

t 



( 7 ) 



Essentially this Measure indicates that portion of the overall infor- 
mation of the document D which is reflected in the joint output of a 
group of indexers {1^ I 2 , i^} . 

Since Y]^^ B 2. y i ? 2 w ^ enever ■ > s, i.e. y increases or remains 
unchanged when the number of indexers in the group increases, the change 
m y can be considered as the indicator of the proportion of additional 
information about the document gained due to the contribution of m-s 
additional indexers. 



^ problem of practical importance is how to obtain the global set 
U with its membership function for a document D . Leaving the question 
open, whether or not it is possible to determine such a set a priori, 
we shall describe an experimental approach to the construction of such 
a global set and development of the measures defined by Eq (3) , Kq (4) , 
and Eq (5), which enabled us to demonstrate the major issues of this 
investigation. The method used for the construction of the global set 
^ rt is not claimed that this is the only method to arrive at such 

a will be outlined following the description of the experiments 

performed and data used. 

Experimental Data . The data used in this investigation came from 
two sources * One source of data was a study per foiled by Schultz, 
Schultz and Orr (11) in which 28S biomedical documents were indexed by 
the author by twelve biomedical scientists who were engaged in research 
in the area, and by eight professional indexers. For the purposes of 
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this study, data from twenty-nine of these documents was used. TWo 
grou p s of eight indexers each were selected, the eight professional 
indexers forming one group and the first eight of the twelve scientists 
forming the other. 

The documents indexed in this study were brief summaries of oral 
reports on current biomedical research which were presented at the 1962 
meeting of the federation of American societies for Experimental Biology. 
These documents consisted of every tenth document in the 1962 meeting 
issue of the Federation Proce edings (Vcl. 21, No. 2, March-April 1962.) 
The twelve scientists who indexed the documents were given a form which 
listed 373 subject categories from which they could select as many terms 
as they wished. The form also provided space for writing in additional 
terms. The scientists were instructed to choose as many terms as they 
felt were required to characterize the document. They were also told 
to index personally, to use the form only when it was natural and helpful 
and to let their responses reflect their viewpoints and terminology. 

The eight professional indexers indexed the same documents using 
the same form. They were also encouraged to assign to each document 

as many terms as they considered necessary. 

The other source of data was an experiment performed at Georgia 
Tech in May of 1968. In this experiment nine graduate students in the. 
School of Information Science each indexed six-een documents. Eight of 
these documents were selected from the 28S biomedical documents described 
above. The other eight were selected from Efficient Reading, by James 
E. Brown (13), a collection of articles designed for use in a rapid 
reading course. Data from eight of the nine graduate students was used 
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for this study. No classification schedules or any other indexing aids 
were Bade available to these students. No restrictions were imposed as 
to the number of tens they were to assign to the documents . 

Thus in the first instance we had slightly controlled indexing , in 
the second, completely free indexing. 

A sample document of the biomedical category is shown in Figure 1. 
Indexing tens which have been assigned to this document by student 
indexers are given in Table 1. 

Figure 1. A sample document from the biomedical 
collection used in indexing experiment. 

PATHWAYS OF INTRACELLULAR HYDROGEN TRANSPORT. 

Bertram Sacktor, Arthur R. Dick * and Eva H. Worneer* 

CRDL, Any Chemical Center, Maryland. 

The mechanisms for oxidizing extramitochondrial DPNH in raam- 
malian tissues were studied. Activity measurements were made of the 
pathways in different tissues, including: heart, skeletal muscle, 
diaphragm, brain, lung, liver, kidney, spleen and testes, for 
oxidizing exogenous DPNH by mitochondria and by cytoplasmic or 
soluble lactic, a glycerophosphate and malic dehydrogenases . 

Analyses of the oxidized and reduced metabolites in the different 
tissues were correlated with the potential of the different pathways. 
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Table 1. Terms Assigned by Student Indexers to 
Sample Document Shown in Figure 1. 
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Derivation of Fuzzy Sets . For each document D , the global fuzzy 
set U was obtained by the following procedure. Every ter* assigned 
by any one of the indexers I., j « 1, 2,..., m, to the document D is 
an element of the fuzzy set U . The membership function of the global 
set U is obtained by assigning to each element , i.e. each term in the 
set, a membership value equal to the ratio of the number of indexers 
idio assigned that term to the document D to the total number of 
indexers who indexed that document, i.e. total number of indexers in 
the experimental group. 

The conceptual justification for this procedure is that the selec- 
tion of a term by an indexer is an indication that that term is repre- 
sentative, at least in his judgment and to some degree of the information 
contained in the document and that it reflects some of that information. 
Hence every term selected by an indexer in the group should be assigned 
a membership value greater than zero. The rest of the terms in the 
vocabulary, i.e. the terms which none of the indexers used, are assumed 
to have the membership value equal to zero. As to the terms which have 
been selected, the degree of concensus of indexers in selecting a term 
is considered to be a proper indicator of their significance in repre- 
senting the information contained in the document. In other words, the 
more indexers select a given indexing term, the more representative it 
should be considered with respect to the contents of the document. A 
term, which is selected by all indexers in the group, is assigned member- 
value 1. Thus, the membership values of the elements (terms) of the 
global fuz’.y set l ; can assumevalues in the interval [0, 1], but terms 
with membership values equal to zero are not shown, since they have no 
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effect on further calculations. 

It should be noted that the fuzzy set U thus arrived at as well 



the experimental group are "fuzzy" with respect to their pertinence or 
significance in representing information contained in a document, but 
they are not "fuzzy" with respect to the criterion of being symbols for 
certain words, i.e. with respect to the partitioning of the vocabulary 
into a set of terms which have been chosen as indexing terms and a set 
of terms which have not been chosen as indexing terms. This is true 
for any fuzzy set: the concept of "fuzziness" pertains to the criteria 
cf membership in a given set, and the same set could be a fuzzy set 
with respect to one criterion and a well defined set with respect to 
another . 

As an example, the fuzzy global set U and three of the sets Tj 
the document D in Figure 1, calculated from the data in Table 1, are 
giv.en below. 



as* the fuzzy sets Tj of indexing terms selected by each indexer in 




T 






ERLC 
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Results of Consistency Tests . Under all experimental conditions, 
the proposed measure of consistency, which reflects the agreement of a 
group of indexers on the significance of the selected terms, produced 
on the average higher consistency values than the measure Eq (2) which 
does not reflect any judgment of significance of the terms. 

The consistency of one pair of professional indexers was calculated 
-for the twenty-nine biomedical documents. The resulting mean and variance 

for both the "unweighted" measure c. . and the proposed "weighted" measure 

* 

c are given in Table 2. 

Table 2. Mean and Variance of Consistency of One Pair 

of Indexers over a Sample of 29 Documents for Consistency Measures 

* 

c. . and c . . 

II 





nCTj/l T 2 ) 

f — 


zf T,n t 

* t 




1 2 

nCTj U T 2 ) 


12 a 

TjU T 2 


Mean 


.24 


.41 


Variance 


.01 


.02 



Consistency was then calculated for all pairs of indexers for a 

specific document. The average consistency according to the measure c^. 

* 

was 0.35, the average according to the proposed measure c .. was 0.59. 

When random samples of pairs of indexes over all documents were calculated. 
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the two consistency values were found to be 0.27 and 0.44 respectively. 

When both the "unweighted" and the proposed "weighted" values were 
calculated for all pairs of professional indexers, it was found that the 
measure c ± . was higher in 2% of the cases, both measures were zero in 
7h% of the cases, and the measure c ^ was higher for the remaining 89^% 

of the pairs of indexers. 

The calculations were then extended to measure the consistency of 
three indexers, four indexers, up to the consistency of the entire group 
f eight indexers using Eq (2) and Eq (3), respectively. 

Random combinations of three indexers, four indexers, and up to 
eight indexers were selected from one document. Again, the proposed 
measure c* tended to yield higher values of consistency than did the 
measure c . . The results are shown in Figure 2. Then the same calcu- 
lations were made for combinations of indexers selected randomly from 
all documents. The results were quite similar to those for one docu- 
ment as may be seen in Figure 3. 



CONSISTENCY 
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Figure 2. Consistency of Indexing: Values 
averaged over combinations of indexers 
for one document 
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Figure 3. Consistency of Indexing: Values 
averaged over combinations of indexers 
and the sample set of documents 
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Quality Measurements . The indexing quality was calculated, using 
£q (4), for each indexer on each document for the professional indexers, 
the' scientists, and the students. The "best" and "worst" indexer in 
each of the three groups was determined using the average quality for 
the indexer. 

The calculated quality measure for eight professional indexers and 
eight scientists, based on a sample of 29 documents, and for eight stu- 
dent indexers based on a sample of 16 documents is shown in Tables 3, 

4, and 5. The documents were ranked for each group according to the 
difference in quality of the best indexer and the worst indexer, and 
the results plotted as shown in figures 4, 5, and t>. It should be noted 
that the indexer with the best average quality score scored higher in 
performance quality for most documents individually than the indexer 
with lowest av^^age quality score. As a matter of fact, there were only 
two instances where the "poor" indexer of the group performed better 

* 

than the best in the group by scoring higher on an individual document. 
This is a strong indication that the proposed optimality measure is an 
adequate tool in evaluating the overall indexer's performance. 

The mean and variance of the quality score, and the average number 
of terms was calculated for all indexers. This data is shown in Table 6. 
It is interesting to note that the quality measure increases with the 
number of terms when the number of terms is small. But this increase 
levels at a given point, and additional terms add little to the quality 
measure. This seems to suggest an "optimal" number of terms which is, 
on the average, slightly higher than the average number of terms selected 
by one indexer. 

*Two statistical tests were employed, a sign test and the Wilcoxon 
ma tched-pairs signed-rank test. Under both tests, the difference in 
quality between the best and the worst indexer is significant at the 
.001 level for all three groups of indexers. 



Zunde - 19 



Figure 4. Quality Scores of the First Indexer 
Group (Professionals) by Document 



SCORES OF THE INDEXER RANKING 
HIGHEST IN OVERALL PERFORMANCE 
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Figure 5. Quality Scores of the Second Indexer 
Group (Scientists) by Document 
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Figure 6. Quality Scores of the Third Indexer Grooo 

(Students) by Document 
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Table 3. Quality Scores by Indexers (Professional) and Documents 
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Table 3— Continued 





Indexer 




Doc. # 


Rank 


1 


2 


3 


4 


i 

5 


1 

6 


7 


8 


21 


22 


.83 


.79 


.76 


.52 


.72 


.72 


.28 


.52 


26 


23 


.23 


.32 


.47 


.26 


.44 


.38 


.53 


.47 


17 


24 


.38 


1 

.38 


» 

.64 


.21 


.36 


.49 


.33 


.51 


29 


25 


.30 


.35 


.*♦3 


.11 


.24 


.44 


.46 


.41 


22 


26 


.62 


.44 


.ei 


.**1 


.50 


.53 


.47 


.47 


20 


27 


.69 


.32 


.43 


.19 


.24 


.27 


.32 


.46 


16 


28 


.53 


.58 


.42 


.*♦7 


.*♦2 


.53 


.53 


.58 


27 


29 


.67 


.54 


.64 


.62 


.*♦6 


.*♦1 


.54 


.62 


Average 




.52 


.44 


.50 


.26 


.53 


.47 


.44 


.51 
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Table 4. Quality Scores by Indexers (Scientists) and Documents 





Ir 


idexer 


Doc. f 


Rank 


1 


2 


3 


4 


5 


6 


1 7 


8 


13 


1 


.53 


.13 


.60 


.60 


.53 


.67 


.53 


.47 


22 


2 


.**5 


.35 


.70 


.70 


.35 


.85 


.60 


.40 


5 


3 


.57 


.21 


.18 


.43 


.39 


.64 


.35 


• 21 


25 


4 


.84 


.42 


.42 


.89 


.53 


.84 


.63 


.42 


24 


5 


.68 


.37 


1 

.53 


.26 


.53 


.79 


.42 


.37 


12 


6 


.35 


.12 


.41 


c:o 


.29 


.53 


,?1 


.35 


26 


7 


.38 


.38 


.25 


.56 


.50 


.75 


.44 


.38 


20 


8 


.67 


.33 


.44 


.39 


.56 


.67 


.61 


.33 


10 


9 


.42 


.26 


.26 


.47 j 


.26 


.58 


.16 


.47 


28 


10 


.64 


.28 


.40 


.52 


.48 


.60 


.56 


.32 


23 


11 


.53 


.16 


.32 


.84 


.37 


.47 


.37 


.26 


1 


12 


.63 


.25 


.29 


.50 


.63 


.54 


.08 


.33 


15 


13 


.52 


.24 


.24 


.48 


.24 


.53 


.57 


.29 


27 


14 


.50 


.21 


.33 


.33 


.58 


.50 


.50 


• 21 


16 


15 


.61 


.44 


.44 


.94 


.50 


.72 


.56 


.44 


18 


16 


.24 


.05 


.36 


.57 


.24 


.33 


.38 


.43 


2 


17 


.46 


.04 


.19 


.38 


.31 


#31 


.35 


.i9 


6 


18 


.48 


.29 


.43 


.38 


00 


.53 


.24 


.38 


4 


19 


.41 


.14 


.23 


.45 


.41 


.36 


.14 


.i4 


14 


20 


.73 


.47 


.47 


.60 


.20 


.68 


.47 


.47 


19 


21 


.64 


.50 


.50 


.57 


.57 


.7! 


.14 


.50 
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Table 4 — Continued 





Indexer 


Doc. # 


Rank 


1 


2 


3 




5 


6 


7 


8 


11 


22 


.38 


.19 


.19 


.31 


.25 


.36 


.13 


.19 


9 


23 


.28 


.28 


.33 


.50 


.11 


.44 


.11 


.39 


21 


24 


.67 


.28 


.44 


.72 


.22 


.*♦4 


.39 


* 

ISJ 

00 


7 


25 


.65 


.30 


.30 


.50 


.55 


.95 


.15 . 


.30 


29 


26 


.80 


.30 


.35 


,20 


.50 


.95 


.90 


.30 


17 


27 


.64 


.50 


.50 


.79 


.57 


.64 


.19 


.50 


8 


28 


.31 


.25 


.25 


.38 




.31 


.99 


.56 


30 


29 


.58 


.37 


.68 


.63 


.68 


.37 


.63 


.21 


Average 




.52 


.28 


.38 


.53 


.**2 


.55 


.39 


.36 



o 
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Table 5. Quality Scores by Indexers (Students) and Documents 





Indexer | 


Doc. # 


Rank 


X 


2 


3 


4 


5 


6 


7 


8 


14 


1 


.51 


.66 


.57 


.60 


.36 


.60 


.30 


1 * 55 1 


9 


2 


.38 


.67 


*26 


.38 


.62 


.38 


.33 


1 .55 


10 


3 


.08 


.92 


.64 


.15 


.54 


.51 


.62 


.46 


12 


4 


.32 


.70 


.52 


.32 


.25 


.52 


.41 


1 * 5 ° 1 


7 


5 


.72 


.78 


.44 


.72 


.61 


.75 


.50 


.64 


15 


6 


.55 


.61 


.40 


.55 


.48 


.64 


.33 


.55 


16 


7 


.59 


.77 


.64 


.48 


.56 


.62 


.49 


.44 


6 


8 


.68 


.61 


.71 


.77 


.61 


.77 


.45 


.71 


2 


9 


.22 


.36 


.33 


.67 


.33 


.79 


.2! 


.33 


4 


10 


.74 


.58 


.48 


.84 


.55 


.71 


.48 


.61 


11 


11 


.35 


.63 


.58 


.70 


.63 


.78 


.53 


CO 

00 

• 


1 


12 


.55 


.40 


.48 


.55 


.45 


.69 


• 31 


.48 


8 


13 


.41 


.41 


.25 


.56 


.56 


.47 


.34 


00 

CO 

• 


13 


14 


.61 


.61 


.25 


.50 


.64 


.37 


.55 


.61 


3 


15 


.61 


.51 


.90 


.70 


.63 


.59 


• 51 


.78 


5 


16 


.40 


.38 


.24 


.45 


.52 


00 

• 


.38 1 


.38 


Average 


.48 


.60 


.48 


.56 


-.52 


.60 


.43 


.55 
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Table 6. Average Number of Indexing Terms Assigned » 
Mean Quality Scores and Variances 
for Professional Indexers, Scientists, and Student Indexers 





Professional Indexer* 


> Scientists 


Students I 


Rank 


No. of 
Terms 


Qu; 


ility 




Quality 




Oualitv 1 


u 


6 2 


No. of 
Terms 


u 


6 2 


I No. of 
I Terms 


» 

u 


6 2 


1 


7.6 


.53 


.02 


3.21 


.55 


.02 


6.3 


.60 


.02 


2 


6.7 


.52 


.03 


3.72 


.53 


.03 


6.4 


.60 


.02 


3 


5.4 


.51 


.01 


3.38 


.52 


.02 


5.0 


.56 


.03 


4 


6.9 


.50 


.03 


2.54 


.42 


.02 


4.4 


.55 


.02 


5 


5.9 


.47 


.03 


2.24 


.39 


.03 


3.2 


.43 


.01 


6 


4;4 


.44 


.04 


1.76 


.38 


‘.02 


4.9 


.48 


.03 


7 


5.0 


.44 


.03 


1.48 


.36 


.02 


4.5 


.48 


.03 


8 


2.5 


.26 


.01 


1.0 


.28 


.01 


3.2 


.43 


.01 J 


AVG 


5.6 






2.4 






5.0 




" \ 
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Measurement of Indexing Exhaustivity . The question to be discussed 
next is how much an index can be improved by "group" indexing, i.e. by 
having one and the same document indexed by several indec.crs and the 
combining the results into a single index for that document. In other 
words, the problem under investigation w?s how much more information is 
contained in the joint product of two indexers as compared to one indexer, 
or in the joint product of three indexers as compared to two, etc. 

One approach to answering this question is to consider the number 
of new terms contributed by another indexer. Calculations were made as 
to the average number of terms selected by one indexer, two indexers, 
and so on, out of the total set of terms selected by all eight indexers 
by considering all combinations of indexers two at a time, three at a 
time, etc. These values were then expressed in percentages of the total 
number of terms resulting from all eight indexers. This data is shown 
in Table 7. Although the number of terms were quite different, the per- 
centages obtained were practically identical for the scientists and pro- 
fessional indexers, and not greatly different for the students. In all 
cases, approximately half the terms obtained by eight indexers were 
obtained by two indexers, and 80% of the terms were obtained by five. 

The calculations were repeated using the proposed measure of exhaus- 
tivity Eq (4) , which reflects the varying degrees of significance of 
information contents of assigned terms. Average exhaustivity ratios were 
calculated for groups of 1, 2, 3, ..., to 8 indexers and the results are 
shown in Table 7, columns 3, 5, and 7. The plot of these exhaustivity 
ratios on a logrithmic scale is shown in Figure 7. 
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Table 7. Percentages of the Total Nunber of Terms 
and Bxhaustivity Levels as a Function 
of the Number of Indexers in the Group 





Professional Indexers 


Scientists 


Students 


Number of 
Indexer.*; 


Percentage 
of Terms 


100. y 


Percentage 
of Terms 


100. Y 


Percentage 
of Terms 


100. Y 


1 


CM 


39 % 


28 % 


36 % 


33 % 


45 % 


2 


46 % 


57 % 


46 % 


55 % 


52 % 


63 % 


3 


59 % 


70 % 


60 % 


1 

69 % 


64 % 


74 % 


4 


70 % 


79 % 


71 % 


78 % 


72 % 


82 % 


5 


79 % 


86% 


80 % 


86% 


81 % 


88% 


6 


87 % 


91 % 


88% 


92 % 


88% 


93 % 


7 


94 % 


96 % 


94 % 


96 % 


94 % 


97 % 


8 


100% 


100% 


100% 


1G0% | 


100% 


100% 




EXHAUSTIVITY RATIO 
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Figure 7. Plot of the Exhaustivity Ratios y Versus the Number 
of Indexers in the Group (on a log scale) 
for a Group of 8 Indexers 




NUMBER OF INDEXERS 



Zuiide - 31 



The interesting result of this experiment is that the exhaustivity 
ratios thus obtained, when plotted on a logarithmic scale with respect 
to the number of indexers, lie on a straight line for each group of 
indexers considered. This suggests a general model of indexing exhaus- 
tivity 



where y is the exhaustivity measure, y^ is the "average exhaustivity" of 
one indexer (intercept of the line with the n-axis), n is the number of 
indexers, m = tan $ is the tangent of the angle at which the line inter- 
sects the n axis, N is the total number of indexers in the group and T 
is the total number of different indexing terms used by the group. 

To test this model, another calculation was performed independently 
for a group of 20 indexers. The results are shown in the plot of Fig. 8 
and can be seen to agree well with the previous findings. However, as 
N approaches T, the dependence of the exhaustivity coefficient y on the 
number of indexers n might be expected to deviate more and more from the 
expression Eq (8) because — and this is intuitively clear — the distri- 
bution law underlying this process would not hold unless N « T. 



y = Yj + m log n and N « T 



( 8 ) 



Since 



y(n+l) « y x + m log (n+1) 



therefore the difference or the rank increment of y 



Ay n = (n+1) -y(n) = m log (n+1) -m log n 



can be approximately equated to 




~ ERIC 



or 



Ay = m log (1 + -) = ■ * • - = a - 

n s v n In 10 n n 
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Figure 8. 



Plot of the Exhaustivity Ratios y Versus the Nuaber 
of Indexers in the Group (on a log scale) 
for a Group of 20 Indexers 
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Ay • n = m = const. 

'n 

Thus Eq (8) or its equivalent fora Eq (9) are familiar expressions 
obtained in other contexts : they can be interpreted — with appropriate 
changes of the names of variables — either as Bradford's law or informa- 
tion scattering (14) , as the law of scientific productivity developed by 
Lotka (15), or as Zipf's law of vocabulary distribution (16). In Bradford' 
law, the expression of the fora of Eq (8) relates the cumulative total 
number of articles to the number of journals in which they were published. 
Lotka obtained an equivalent functional relationship (although in more 
general fora) between the number of scientists and the number of scientific 
Contributions they make. In the case of Zipf, an expression of the type 
of Eq (9) was obtained relating the word length with the number of word 
occurrences. We have shown that the same model applies also to express 
the relation between the level of exhausitivity of information represen- 
tation or indexing and the number of decision makers (or indexers) whose 
individual judgments are combined to reproduce the information contents 
of the object which is jointly observed (specifically, to produce a 
cumulative index). In other words, this could be considered as the law 
expressing the dependence of the "completeness" of knowledge on the size 
of the community which might be expected to contribute to this knowledge. 

The fact that the same model is applicable to the process of infor- 
mation scattering (Bradford ' s Law) , growth of scientific productivity 
(Lotka 's Law), vocabulary distribution (Zipf's Law), and, as we have 
now shown, to the completeness or exhaustivity of information represen- 
tation as reflected in group indexing, implies that the same basic law 
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of distribution of information flow should underlie all these processes. 
This topic will be discussed in greater detail in a different paper. 

Conclusions . Measures of indexing consistency should reflect not 
only the formal agreement of indexers on a number of terms, but also the 
significance of terms on which the indexers agree or disagree. This can 
be achieved if the sets of indexing terms are considered not well-defined, 
but fuzzy sets with respect to the significance judgment. A procedure for 
designing such sets has been proposed and it has been shown that the pro- 
posed approach can be extended to define the quality and exhaustivity of 
indexing. The experimental results on indexing exhaustivity based on 
this approach warrant the conclusion that the increase of information as 
a result of group indexing obeys the same law as that of the scattering 
of information (Bradford) , of scientific productivity (Lotka) , and vocab- 
ulary distribution (Zipf) . 
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APPENDIX 



Operations on Fuzzy Sets . The purpose of this appendix is to 
expand the concept of fuzzy sets discussed in the text. The definitions 



are from Zadeh (12) . 

Let A, B and C be fuzzy sets in a set X of objects x with member- 
ship functions f^(x), fg(x) and f^Cx) respectively. Then the following 
definitions are made: 

Equality : The fuzzy sets A and B are said to be equal , written 

A = B, if f^(x) = fjj(x) ^ or x 

Subsets : The fuzzy set A is said to be a subset of the fuzzy set 

B if f^(x) _< fg(x) for all x in X. 

Union: The union of two fuzzy sets A and B, written A O B, is a 

fuzzy set C where f^(x) = Max [f^(x), fg(x)], x e A which may be 

abbreviated f ^ = f ^ v f^. The union of n fuzzy sets Tj, T^, ... T^ is 

the fuzzy set C where f c = f T Y f T Y . . . Y f T = X f 

12 n i 

Intersection : The intersection of two fuzzy sets A and B, written 

A A B, is a fuzzy set C where f^(x) = Min (f^(x), fg(x)] , x e A which 

may be abbreviated f ^ = f ^ a f The intersection of n fuzzy sets 

Tj, T 2 , . .., T n is the fuzzy set C where f c = f T a f f a ... a f T = a 

12 n 

It can bs shown that fuzzy sets have the following properties: 

AUB = BUA A/1b = B/1 A 

(AUB)UC = AU(BUC) (A A b) a c = a 0 (b O c) 

A A (B U C) = (A A B) u (A 0 C) aU(bHc) = (aUb)O (aUc) 



A Ua = A 



A^A = A 
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